-
Notifications
You must be signed in to change notification settings - Fork 59
feat: quit checkpoint engine when error occurs #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements error handling in the checkpoint worker process to ensure graceful shutdown when errors occur during weight updates. The changes propagate exceptions from worker processes back to the parameter server, allowing coordinated error handling across all distributed ranks.
Key Changes:
- Worker processes now send exception objects to the parameter server when
run()fails, instead of silently failing - Parameter server synchronizes error states across all ranks before continuing, ensuring no rank proceeds if any rank encounters an error
- Process group cleanup moved to a
finallyblock to ensure proper resource cleanup even when errors occur
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| checkpoint_engine/worker.py | Added try-except block to catch exceptions during weight updates and send them back to the parameter server |
| checkpoint_engine/ps.py | Modified error handling to collect responses from all ranks and raise error if any rank failed; moved cleanup to finally block |
| tests/test_error_quit.py | Added integration test that verifies proper error propagation and process termination when worker process encounters runtime errors |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…nt-engine into fix/quit-when-error
…nt-engine into fix/quit-when-error
resolve #38
send an
Exceptionto PS whenupdate_weights_from_ipcexecuterunfailed._update_per_bucketneeds to ensure all processes not to get anExceptionbefore continuing next updating iteration